COVARIANCE ESTIMATION FOR DISTRIBUTIONS WITH 2 + e 

MOMENTS 



NIKHIL SRIVASTAVA AND ROMAN VERSHYNIN 

Abstract. We study the minimal sample size N — N(n) that suffices to estimate the 
covariance matrix of an n-dimcnsional distribution by the sample covariance matrix in the 
operator norm, with an arbitrary fixed accuracy. We establish the optimal bound N = 0(n) 
for every distribution whose /c-dimensional marginals have uniformly bounded 2 + e moments 
outside the sphere of radius O(vfc). In the specific case of log-concave distributions, this 
result provides an alternative approach to the Kannan-Lovasz-Simonovits problem, which 
was recently solved by Adamczak, Litvak, Pajor and Tomczak-Jaegermann [!'.. Moreover, 
a lower estimate on the covariance matrix holds under a weaker assumption - uniformly 
bounded 2+e moments of one-dimensional marginals. Our argument consists of randomizing 
the spectral sparsifier, a deterministic tool developed recently by Batson, Spielman and 
Srivastava [I]. The new randomized method allows one to control the spectral edges of the 
sample covariance matrix via the Stieltjes transform evaluated at carefully chosen random 
points. 



1. Introduction 

1.1. Covariance estimation problem. Estimating covariance matrices of high dimen- 
sional distributions is a basic problem in statistics and its numerous applications. Consider 
a random vector X valued in IR n and let us assume for simplicity that X is centered, i.e. 
EX = 0; this restriction will not be needed later. The covariance matrix of X is the n x n 
positive semidefinite matrix 

E =EXX T . 

Our goal is to estimate E from a sample X±, . . . , Xn taken from the same distribution as X. 
A classical unbiased estimator for E is the sample covariance matrix 

N 



E " JV 

i=l 

A basic question is to determine the minimal sample size N which guarantees that E is 
accurately estimated by T, N . More precisely, for a given accuracy e > we are interested in 
the minimal N = N(n, e) so that 

EllE^v - E|| < e||E|| 

where || • || denotes the spectral (operator) norm. Replacing X by H^X and X t by Yr^X^ 
we reduce the problem to the distributions for which E = J, i.e. to isotropic distributions. 
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1.2. Sampling from isotropic distributions. We consider independent isotropic random 
vectors X{ valued in R n , i.e. such that KXiX? = I. Our goal is to determine the minimal 
sample size N = N(n, e) such that 

E\\E N - S|| < e. 

For obvious dimension reasons, one must have N > n. A remarkably general result of 
M. Rudelson ([12], see [161 Section 4.3]) yields that if \\X\\ 2 = 0(y/n) almost surely, then 

(1.1) N = 0{n\ogn) 

where the O(-) notation hides the dependence on e here and thereafter. It is well known that 
the logarithmic oversampling factor cannot be removed from (11.11) in general, for example if 
the distribution is supported on 0(n) points; see Section ITTH1 

Nevertheless, it is also known that for sufficiently regular distributions the logarithmic 
oversampling factor is not needed in ( 11. ip . This is a property of the standard normal dis- 
tribution in R™ and, more generally, of the distributions with sub-gaussian one- dimensional 
marginals. Namely, 

N = 0(n) 

holds for every distribution that satisfies 

(1.2) sup (E | (X, x) \ p ) 1/p = 0{y/p) for p > 1. 

IMIa<l 

This result can be obtained by a standard covering argument, see [TH Section 4.3]. 

It is an open problem to describe the distributions for which the logarithmic oversampling 
is not needed, i.e. for which N = 0{n). The gap between sub-gaussian distributions where 
this bound holds and discrete distributions on 0(n) points where it fails is quite large. 

It is already a difficult problem to relax the sub-gaussian moment assumption (II. 2p to 
anything weaker while keeping N = 0(n). A major step was made by R. Adamczak, 
A. Litvak, A. Pajor and N. Tomczak-Jaegermann [T], who showed that N = 0(n) still holds 
(in fact, with high probability) under the sub-exponential moment assumptions: 

(1.3) ||X|| 2 = 0(y/n) a.s., sup (E | (X, x) \ p ) l / p = 0(p) for p > 1. 

INh<i 

As an application, it was shown in [IJ that N = 0(n) holds for log-concave distributions, and 
in particular for the uniform distributions on isotropic convex bodies in M. n . This answered 
a question posed by R. Kannan, L. Lovasz and M. Simonovits in [9]. 

The second author of the present paper speculated in [T5] that N = 0{n) should hold for 
a much wider class of distributions than sub-exponential, perhaps for all distributions with 
2 + e moments. (The second moment - the variance - is assumed to be finite by the nature of 
the problem, as otherwise the covariance matrix is not defined.) The goal of the the current 
paper is to provide a result of this type. 

Theorem 1.1. Consider independent isotropic random vectors Xi valued in MJ 1 . Assume 
that Xi satisfy the strong regularity assumption: for some C, r\ > 0, one has 

(SR) P{||PXi||^ > t} < Ct~ 1 ~ r ' fort> Crank(P) 

for every orthogonal projection P inM. n . Then, for e G (0,1) and for 

N>C main e- 2 - 2 ^-n 
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one has 

A 

" < e. 



ii 1 N 



N 
i=i 

i/ere C main = 512(48C) 2+2 / ,? (6 + 6/77) 1 + 4 / T ? ? and as before \\ ■ \\ denotes the spectral (operator) 
matrix norm and \\ ■ H2 denotes the Euclidean norm in M. n . 

Remark. Since the distribution of PX{ is isotropic in the range of P, we have E ||PXj||| = 
rank(P). This explains why ( ISR|) concerns only the tail values of t which are above rank(P). 

1.3. Covariance estimation. Returning to the covariance estimation problem, we deduce 
the following. 

Corollary 1.2 (Covariance estimation). Consider a random vector X valued in W 1 with 
covariance matrix E. Assume that for some C, 77 > ; the isotropic random vector Z = 
YT X I 2 X satisfies 

(SR) ¥{\\PZ\\ 2 2 >t} KCr 1 - 11 fort > Crank(P) 

for every orthogonal projection P inM. n . Then, for every e G (0, 1) and 

N > C ■ E~ 2 ~ 2/ri ■ n 

the sample covariance matrix E^r obtained from N independent copies of X satisfies 

E||£jv-E|| < e||S||. 

This result follows by applying Theorem 11.11 for the independent copies of the random 
vectors Zi = E -1 / 2 Xj instead of X i: and by multiplying the matrix — I in 

( II. 4p by E 1 / 2 on the left and on the right. Thus, for distributions satisfying (ISR|) we conclude 
that the minimal sample size for the covariance estimation is N = 0(n). 

Let us illustrate these results with two important examples. 

1.4. Sampling from log-concave distributions and convex sets. A notable class of 
examples where Corollary 11.21 applies is formed by the log-concave distributions, which in- 
cludes the uniform distributions on convex bodies. Consider a random vector X with a 
log-concave distribution in M n , i.e. whose density has the form e~ v ^ where \ogV(x) is a 
convex function on W a . A concentration inequality of G. Paouris [llj implies that regularity 
assumption ( ISR|) holds for X. Indeed, consider an orthogonal projection P in M n and let 
k = rank(P). The distribution of the isotropic random vector Z = E -1 / 2 X is log-concave in 
M n , and so is the distribution of PZ in the /c-dimensional space range(P). The theorem of 
G. Paouris then states that 

P{ \\PZ\\l > t} < exp(-ct) for t > Ck 

where C, c > are absolute constants. This is obviously stronger than assumption (jSRj), so 
Corollary 11.21 applies. 

We conclude that the minimal sample size for estimating the covariance matrix of a log- 
concave distribution is N = 0(n). This matches the bound obtained by R. Adamczak et al. 
P] , though it should be noted that the guarantee of [1] holds with probability that converges 
to 1 exponentially fast as n — > 00, whereas ours holds only in expectation. We have not tried 
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to obtain probability bounds of this type; note however that under our general assumption 
fISRj) . the probability can not converge to 1 faster than at a polynomial rate in n. 

1.5. Sampling from product distributions. A distribution does not have to be log- 
concave in order to satisfy the regularity assumptions in Theorem 1 1 . 1 1 and Corollary II .21 For 
example, all product distributions with finite 4 + e moments have the required regularity 
property. We can deduce this from the following thin shell estimate: 

Proposition 1.3 (Thin shell probability for product distributions). Let p > 2, and consider 
a random vector X = (£i,...,£ n ) ; where are independent random variables with zero 
means, unit variances and with uniformly bounded (2p)-th moments. Then for every 1 < 
k < n and for every orthogonal projection P in W 1 with rankP = k, one has 

(1.5) E\\\PX\\ 2 2 -k\ p <k p/2 . 

The factor implicit in (jl.5p depends only on p and on the bound on the (2p)-th moments. 
The proof of Proposition 11.31 is given in the Appendix. 

Applying Chebychev's inequality together with (II. 5p we obtain for t > k that 

p{||px||2 > k + t] < r p -E|||PX||2- k\ p <r p k p/2 <r p/2 . 

Thus for p > 2 we get a sub-linear tail, as required in the regularity assumption f lSRj) . 

This shows that Theorem M . 1\ applies for product distributions in W 1 with uniformly bounded 
4 + e moments, and it gives N = 0(n) for their covariance estimation. Note that this mo- 
ment assumption is almost tight - according to [5], if the components are i.i.d. and have 
infinite fourth moment, then limsup ||£jv|| — > oo as n — > oo and n/N —ty> 0. (This is 
because in this situation at least one of the Nn i.i.d. coordinates of X±, . . . ,Xn will likely 
to be large.) 

1.6. Extreme eigenvalues. Theorem 1 1 . 1 1 states that, for sufficiently large N, all eigenvalues 
of the sample covariance matrix S^r = J2iLi-^i-^-T are concentrated near 1. It is easy to 
extend this to a result that holds for all N, as follows. 

Corollary 1.4. Letn,N be arbitrary positive integers, suppose Xi are independent isotropic 
random vectors in W 1 satisfying flSRI) . and let y = n/N . Then the sample covariance matrix 
= j? J2i=i XiXf satisfies: 

(1-6) 1 - C lV c < E A min (Sjv) < E A max (S A r) < 1 + C^y + y c ). 

Here c = C x = 512(16C) 1+2 / , '(6 + 6/r/) 1+4 ^ ; and A min (S iV ) ; A max (S iV ) denote the 

smallest and the largest eigenvalues o/S^v respectively. 

We deduce this result in Section [3j One can view (11.61) as a non-asymptotic form of 
the Bai-Yin law for the extreme eigenvalues of sample covariance matrices [3J. This law, 
associated with the work of S. Geman, Z. Bai, Y. Yin, P. Krishnaiah and J. Silverstein, 
applies for product distributions, specifically for random vectors X = . . . , £ n ) with i.i.d. 
components £j with zero mean, unit variance and finite fourth moment. For such distributions 
one has asymptotically almost surely that 

(1.7) (1 " VV) 2 - 0(1) < ^min(Sjv) < A max (S W ) < (1 + + 0(1) 
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as n — > oo and n/N — > y G [0,1), see the rigorous statement in [3]. This limit law is 
sharp. On the other hand, the inequalities (11. 6p hold in any fixed dimensions N, n and for 
general distributions (as in Theorem II .ip . without any independence requirements for the 
coordinates. 

Remark. Comparing (11.61) with (I1.7P one can ask about the optimal value of the exponent 
c, in particular whether c = 1/2. In a recent paper [2], R. Adamczak et al. obtained 
the optimal exponent c = 1/2 for log-concave distributions, and more generally for sub- 
exponential distributions in the sense of ( II. 3p . As (II. 3p implies (ISRj) with rj = (p — l)/2 and 
C < (0(p)) p , Theorem 11.11 recovers a bound of c = 1/2 — l/(p + 1) = 1/2 — o(l) as p — > oo. 

Remark (Random matrices with independent rows) . Corollary 11.41 can be interpreted as a 
result about the spectrum of random matrices with independent rows. Indeed, if A is the 
matrix with rows Xi then Eat = jrY^iLi-X-iX? = ^A T A. So the singular values of the 
matrix -^A are the same as the eigenvalues of the matrix Ejy, and they are controlled as 
in (11.61) . In particular, under the regularity assumption ( ISRp on X { we obtain that 

(E||A|| 2 ) 1 / 2 <C 2 (v / iV + v / ^) 

where C 2 = y/2C\ and C\ is as in Corollary 11.41 

Notice that while the rows of matrix A are independent, the columns of A may be depen- 
dent. The simpler case where all entries of A are independent is well understood by now. In 
the latter case, if the entries have zero mean and uniformly bounded fourth moments, the 
bound E || A\\ < \/~N + ^fn follows, for example, from a general inequality of R. Latala [10] . 

1.7. Smallest eigenvalue. Our proof of Theorem 11.11 consists of two separate arguments 
for upper and lower bounds for the spectrum of the sample covariance matrix. It turns 
out that the full power of the strong regularity assumption ( ISRp is not needed for the lower 
bound. It suffices to assume 2 + rj moments for one- dimensional marginals rather than for 
marginals in all dimensions. This is only slightly stronger than the isotropy assumption, 
which fixes the second moments of one-dimensional marginals, and it broadens the class of 
distributions for which the result applies. We state this as a separate theorem. 

Theorem 1.5 (Smallest eigenvalue). Consider independent isotropic random vectors Xi 
valued in W 1 . Assume that Xi satisfy the weak regularity assumption: for some C,r) > 0, 

(WR) sup E\(Xi,x)\ 2+ ' n < C. 

IMIa<l 

Then, for e > and for 

(1.8) N>C loaer e- 2 - 2 ^-n, 

the minimum eigenvalue of the sample covariance matrix S^v = j[ Yli=i XiXf satisfies 

EA min (S w ) >l-e. 

HereC lower = A0(10C) 2 ^. 

Remark (Moments vs. Tails) . We have chosen to write (IWRp in terms of moments rather than 
in terms of tail bounds as in ( ISRp . By integration of the tails one can check that, for any 
given rj > 0, (ISRp with parameter C implies ( IWRp with parameter C = C(2 + 2/rj). 
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In the remainder of the paper we will use (IWRjl for theorems regarding only the smallest 
eigenvalue and (ISRI) for theorems which involve the largest one. 

Remark (Product distributions with 2 + 77 moments) . Many distributions of interest satisfy 
(IWRI) . For example, let X = (£1, • • • , have i.i.d. components with zero mean, unit 
variance and finite (2 + 77) moment. Then a standard application of symmetrization and 
Khintchine's inequality (or a direct application of Rosenthal's inequality [13], see [8]) shows 
that one- dimensional marginals of X also have bounded (2 + 77) moments, i.e. (IWR|) holds. 

In the context of the Bai-Yin law discussed in Section 11.61 this indicates that the smallest 
eigenvalue of a random matrix can be approximately controlled (as in (11.61) ) even if the fourth 
moment is infinite. However, as we already recalled, four moments are necessary to control 
the largest eigenvalue in the classical Bai-Yin law [5]. 

Remark (Covariance estimation). Theorem 11.51 can be used to obtain a /overestimate for 
the covariance matrix under the weak regularity assumption (IWRI) . 

1.8. Optimality of the regularity assumptions. Let us briefly mention two simple and 
known examples that illustrate the role of regularity assumptions (ISR|) and (IWRI) in the control 
of the largest and smallest eigenvalues respectively. 

For the largest eigenvalue as in Theorem 11.11 it is not sufficient to put a regularity assump- 
tion of the type (ISR|) only on one- dimensional marginals, as it is done in Theorem 11.51 for 
the smallest eigenvalue. Even the following very strong (exponential) moment assumption 
is insufficient: 

(1.9) sup ¥{\{X,x)\ > t] < Cexp(-ct) for t > 0. 

IM| 2 <1 

Indeed, consider a random vector X = where Z is a random vector uniformly distributed 
in the Euclidean sphere in W 1 centered at the origin and with radius y/n, and where £ is a 
standard normal random variable. Then X is isotropic, and all one-dimensional marginals 
of X have exponential tail decay (jl.9p . However, the multiplier £ produces a dimension- 
free tail decay of the norm of Z, namely P{||X|| 2 > ty/n} = P{£ > t} > exp(-C't 2 ) 
for t > 0. It follows that a sample of N independent copies X\, . . . , Xjy of X satisfies 
Emax.;<Ar > iVlog N, so the matrix Ejy = j? J2f=i XiXf satisfies 

E||Ejv-/|| > jV^EmaxH-XilH- 1 >logiV, 

i<N 

which contradicts the conclusion of Theorem ll.il This example is essentially due to G. Aubrun, 
see [TJ Remark 4.9]. 

Remark. It is not clear whether Theorem 11.11 would hold if, in addition to (2 + 77) moments 
on one- dimensional marginals, one puts a total boundedness assumption 

||X|| = 0(\/n) almost surely. 

A conjecture of this type is discussed in [15J where a version of Theorem is proved under 
this assumption, with 77 = 2 but with an additional (log log n)°^ oversampling factor. 

Furthermore, we note that for the smallest eigenvalue as in Theorem 11.51 one can not drop 
the regularity assumption (IWR|) . i.e. the assumption with 77 = is not sufficient. This is seen 
for Xi uniformly distributed in the set of In points (±ejt) where (efc)£ =1 is an orthonormal 
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basis in M. n . Indeed, in order that the smallest eigenvalue of the matrix S^v = -h Ylf=i X%Xf 
be different from zero, one needs Ejy to have full rank, for which all n basis vectors need 
be present in the sample X\, . . . , X N . By the coupon collector's problem, for this to happen 
with constant probability one needs a sample of size iV > nlogn. For iV = o(n log n), the 
smallest eigenvalue is zero with high probability, so the conclusion of Theorem 11.51 fails. 

1.9. The argument: randomizing the spectral sparsifier. Our proof of Theorem [T7T1 
consists of randomizing the spectral sparsifier invented by J. Batson, D. Spielman and N. Sri- 
vastava |4J (see [14]). The randomization makes the spectral sparsifier appear naturally in 
the context of random matrix theory. The method is based on evaluating the Stieltjes trans- 
form of Sjv while making rank one updates. However, in contrast to typical methods of 
random matrix theory (and to the spectral sparsifier itself), we shall evaluate the Stieltjes 
trasnform at random real points. 

Let us illustrate the method by working out a crude upper bound 0(1) for the largest eigen- 
value of £jy Equivalently, we want to show that a general Wishart matrix An := NUn = 
J2iLi XiXj has all eigenvalues bounded by 0{N). We evaluate the Stieltjes transform 

n 

(1.10) m AN {u) = tr{uI-A N )- 1 = Y,{u-\{A N )y\ uGl, 

i=i 

where Aj(Ajv) denote the eigenvalues of A^. This function has singularities at the points 
\{An) and it vanishes at infinity. So the largest eigenvalue of A^ is the largest u where 
m A N {u) = oo. However, such u is difficult to compute. So we soften this quantity by 
considering the largest number that satisfies 

(1.11) m AN (u N ) = 4> 

where is a fixed sensitivity parameter, for example 0=1. 

The soft spectral edgeu^ provides an upper bound for the actual spectral edge, A max (Ajv) < 
un. So our goal is to show that 

Eu N = O(N). 

This is the same problem as in [4j except the eigenvalues and hence the soft spectral edge 
un are now random points. The randomized problem is more difficult as we note below. 

As opposed to the largest eigenvalue of A, the soft spectral edge un can be computed 
inductively using rank-one updates to the matrix; un will move to the right by a random 
amount at each step as we replace Ak-\ by Ak = Ak-i + X^X^. Initially, A = so u = n. 
It suffices to prove that the Uk moves by 0(1) on average at each step: 

(1.12) E(« fc -« fc _i) = 0(l). 

Indeed, by summing up we would obtain the desired estimate Kun = n + 0(1) N = O(N). 

The soft edge can be recomputed at each step because it is determined by the Stieltjes 
transform m Ak (u), which in turn can be recomputed using Sherman-Morrison formula as is 
done in [4], which gives for every u G M that 

XTiuI - A)- 2 X k 

(1.13) m Ak (u) = m^u) + 



" ! ' 1 1 A) '.V 
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This reduces proving (11.121) to a probabilistic problem, which is essentially governed by the 
distribution of the random vector X^. 

The difficulty is that we are facing a non-linear inverse problem. Indeed, for a fixed u it is 
not difficult to compute the expectation of rriA k {u) from (11.131) . and in particular to bound 
the expectation by 0; this is done in [J]. However, we require the identity m^ fc (-u) = to 
hold deterministically, because the largest u that satisfies it defines the soft spectral edge of 
Ak as in (11.111) . The task of computing the expectation of a random number u for which 
rriA k {u) = is a highly non-linear inverse problem [61 Section 4.1]. This is where some 
regularity of X^ with respect to the eigenstructure of A^-i becomes essential. A technical 
part of our argument developed in most of the remaining sections is to realize and prove 
that a small amount or regularity encoded by (ISR|) or (IWRI) is already sufficient to control the 
solution to the inverse problem, and ultimately to control the spectral edges of A. 

1.10. Organization of the paper. The rest of the paper is organized as follows. We start 
with the somewhat simpler Theorem 11.51 for the smallest eigenvalue in Section [2J A corre- 
sponding result for the largest eigenvalue, Theorem 13. 11 is proved in Section [21 Corollary II .41 
is also deduced in Section [31 Combining Theorems 11.51 and 13.11 in Section [H we obtain the 
main Theorem 11.11 on the spectral norm. In the Appendix, we prove Proposition 11.31 on the 
regularity of product distributions. 

Acknowledgement. The authors are grateful to the referees whose comments improved 
the presentation of the paper. 

2. The Lower Edge 

We begin by proving Theorem 11.51 about the the lower edge of the spectrum, which is 
slightly simpler and requires fewer assumptions than the upper edge. As in [4], the tool that 
we use to do this is the lower Stieltjes transform 

n 

m A {£) = tr(A - £iy l = (H A ) ~ <T\ ^R. 

i=l 

Note that m A (£) = — to_a(— i) where ttla is the usual Stieltjes transform in ( ll.lOp . 

For a sensitivity value > 0, we define the lower soft spectral edge i^A) to be the smallest 
£ for which 

m A (£) = 0. 

Since m A {£) increases from to oo as £ increases from — oo to the lower spectral edge X m i n (A), 
the value £$(A) is defined uniquely, and we always have the bound 

£^(A) < A min (A). 

For — > oo we have £^{A) — > A m i n (A). However, we will work with small sensitivity 
G (0, 1), which will make the soft spectral edge £^(A) softer and easier to control. 

The crucial property of £^(A) is that it grows steadily under rank-one updates. Consider 
what happens when we add a random rank-one matrix XX T to A y £1, where X is chosen 
from an isotropic distribution on W 1 . As Etr(A + XX T ) = tr(A) + trEXX T = tr(A) + n, 
we expect the eigenvalues of A + XX T to have increased by 1 on average. It turns out that 
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£(j,(A) behaves almost as nicely as this if the distribution of X is sufficiently regular and the 
sensitivity <fi is sufficiently small. This is established in the following theorem. 

Theorem 2.1 (Random Lower Shift). Suppose X is an isotropic random vector in W 1 
satisfying the weak regularity assumption: for some C,rj > 0, 

(WR) sup E\(X,x)\ 2+r > < C. 

\\x\\<\ 

Let e > and 

where c j^] = 10(5C) 2 / r? . Then for every symmetric n x n matrix A one has 

E£ (i> (A + XX T ) >£ (f) (A) + l-£. 
Iterating Theorem 12.11 easily yields a proof of Theorem 11.51 as follows. 
Proof of TheoremUIB Let A = and A k = A k _ x + X k X^ for k < N. Setting = 
qXTJ£ 1+2/r? , we ^ nc ^ 

— 71 

U(Ao) = 



Applying Theorem 12.11 inductively to A , A\, . . . , A^, we find that 

E [i^Au) - i^M-i) I A k -i] > 1 - £ for all k < N, 

where we take the conditional expectation with respect to the random vector X k given the 
random vectors X±, . . . ,X k -i, i.e. given A k -\. Summing up these bounds yields 

(2.1) Rl+{A N )>t+(Ao)+N(\-e). 

Recalling that A m ; n (Ajv) > I^An) and dividing both sides of ( 12. ip by N, we conclude that 

For N > n/e<p, the bound becomes 1 — 2s. Substituting the value of and replacing e by 
e/2 gives the promised result. □ 

The rest of this section is devoted to proving Theorem 12.11 Given a matrix A, a real 
number £ < Amin (A), and a vector x G M n , we say that 5 > is a feasible lower shift if 

A >- + 5)/ and m A+xxT {£ + 5) < m A {£). 

The definition of the soft spectral edge £ = £ ( p{A) along with monotonicity of the Stieltjes 
transform implies that 

^(A + xx?) >£^A) + 5 

for every feasible lower shift 5. So we will be done if we can produce a feasible shift 5 such 
that K5 > 1 — e where the expectation is over random X. 

We begin by reducing the feasibility for a shift 5 to an inequality involving two quadratic 
forms. The following lemma appeared in [1] , and we include it with proof for completeness. 
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Lemma 2.2 (Feasible Lower Shift). Consider numbers £ G R, 5 > 0, a matrix A y (£ + 5)1 
and a vector x. Then a sufficient condition for 

(2.2) m A+xxT (£ + 5)<m A (£) 

(2.3) ] x l^ A Z e e Z 5 s)-2 - xT ( A - £ - *r lx =■ \<*& x ) - *v> x ) > L 

Proof. We begin by expanding m A+xx T {£ + 5) using the Sherman- Morisson formula: 
rn A+xxT (£ + 5)=ti(A + xx T -£-5)- 1 =ti{A-£-5)- 1 '''' ( A ~ ' ~ * ' 



l + x T (A-£-5)~ 1 x' 
Furthermore, 

tr{A -£- 5)- 1 = m A {£) + tr[{A -£- -(A- £y 1 }. 
The assumption A y (£ + 5)1 implies that 

(A - £ - 5)- 1 -(A- £)- 1 <5(A-£- 5)~ 2 . 
Combining these estimates we see that ( 12.21) holds as long as 

which we can rearrange into (12.31) observing that all quadratic forms involved are positive. □ 

The inequality (12.31) is quite nontrivial in the sense that 5 appears in many places, and it 
is not immediately clear from looking at it what the largest feasible 5 is given A, x, and £. 
In the following lemma, we present a tractable and explicit quantity defined solely in terms 
of gi(0, x) and q 2 (0,x) which always satisfies (I2.3P and thus provides a lower bound on the 
best possible 5. 

Lemma 2.3 (Explicit Feasible Shift). Consider numbers £ G 1R, <fi > 0, a matrix A y £1 
satisfying m A {£) < <f), and a vector x. Then for every t G (0, 1), the shift 

5 '■= (1 -£) 3 <?2(0,£) l{gi(0,s)<t} l{g 2 (O,x)<t/0} 

satisfies A y (£+5)1 and condition A2.3\) . Therefore 5 is a feasible lower shift, i. e. m A+xxT (£+ 
5)<m A (£). 

The proof is based on regularity properties of the quadratic forms q\ and q 2 , which we 
state in the following two lemmas. 

Lemma 2.4 (Regularity of Quadratic Forms). Consider numbers £ G R, <p > 0; a matrix 
Ay £1 satisfying m A (£) < <fi, and a vector x. Then for every positive number 5 < 1/0, one 
has Ay (£ + 5)1, and moreover: 

(i) gi(0,x) < gi (5,x) < (1-5<P)- I qi (0,x); 

(ii) (l-5<j ) ) 2 q 2 (0,x)<q 2 (5,x) < (l-S^q^x). 



1 To ease the notation, we sometimes write A — u instead of A — ul. 
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Proof. The assumption A y £1 states that all eigenvalues A, of A satisfy Aj > £. Together 
with the assumption m A (£) = Ylii^i ~ ^) 1 — this implies that (Aj — £)^ 1 < for all z, 
hence A« — £ > 1/0 > 5 and A >- {£ + 5) J as claimed. 
(i) Let (i[>i)i< n denote the eigenvectors of A; then 



(2.4) gi (5,x)=Yl 



i=i 



Recalling that A, — £ > 1/0, we have the comparison inequalities 

(1 - 50)(A 4 - £) = Xi - £ - 0<5(Ai -£) <\i-£-5 <\i-£. 

Using these for every term in (12.41) we complete the proof of (i). 

(ii) Similar to (i), noting that the numerator and denominator of q2 are increasing in 5. □ 

Lemma 2.5 (Moments of Quadratic Forms). Consider numbers £ £ M., > and a matrix 
Ay £1 satisfying m A {£) < 0. If X is an isotropic random vector satisfying ( IWR|) . then for 
p — 1 +rj/2 the following moment bounds hold: 

(i) Egx(0,X) = m A (£) < and 1^(0, X) p < C<p p ; 

(ii) Eq 2 (0,X) = 1 andEq 2 {0,X) p < C. 

Proof, (i) As in the proof of the previous lemma, let (ipi)i< n denote the eigenvectors of A. 
By isotropy we have 



E 9l (0,A-) = £^^ = m,(f)<^. 

i=l 1 

For the moment bound we use Minkowski's inequality to obtain 



i=l 1 i=l 1 

(ii) Analogous to (i). □ 

We can now finish the proof of Lemma 12.31 
Proof of Lemma \2.3[ First observe that by construction 
(2.5) 5 < q 2 (0,x) l{ q2 {a,x)<t/<p} < t/(j) < 1/0, 

so that we always have A y (£ + 5)1 by Lemma [2.41 

If either of the indicators in the definition of the shift 5 is zero, then 5 = 0, which is 
trivially feasible and we are done. So assume both indicators are nonzero, that is q±(0, x) < t 
and q 2 (0,x) < t/0. By Lemma [2.2[ it suffices to prove inequality (12.31) . which is equivalent 
to 



1 + qi(6,x) 
ll 



> 5. 



We can show this by replacing 5 with zero using Lemma 12.41 
q 2 (5,x) ^ g 2 (O,x)(l-50) 2 



l + qi(S,x) ~ l + qi(0,x){l-S(f))- 




(as 50 < i by ([23]) and gi(0, x) < t) 



□ 



We now complete the proof of Theorem 12 .11 by using the regularity properties of X to show 
that the expectation of 5 as defined in Lemma [2.31 is large. Roughly speaking, this happens 
because (1) S is defined to be slightly less than g 2 (0,X) whenever both q 1 (0,X) and g 2 (0,X) 
are not too large; (2) that event occurs with very high probability when is sufficiently 
small (3) the expectation of g 2 (0,X) equals 1. 

Proof of Theorem \2.1\ Let i = £<f,(A); then m A {C) = < < ^2A£ 1+2 ^ n by assumption. Define a 
feasible shift 5 as in Lemma [2.31 for t = e/5. Recall that it suffices to prove that K5 > 1 — e. 
According to Lemma I2.3[ 



Lemma |2.5[ we have Eg 2 (0, X)) p < C. Next, the probability can be estimated by union 
bound, Markov's inequality and the moment bounds of Lemma 12.51 which gives 



E 6 = (1 - t) 3 [ E q 2 (0, X) - E q 2 (0, X) l {3l(0 ,x)> t v g2 (o,x)> t M 

> (1 - tf \l - (E q 2 {0, Xffp ■ (P{ 9l (0, X) > t V q 2 (0, X) > t /<(>})$ 



where we used Holder's inequality with exponents p = 1 + rj/2 and q 



p-i 



p. 



2/77 + 1. By 



P{gi(0,X) >tVg 2 (0,X) > 

< P{ gi (0, Xy > t p } + P{g 2 (0, Xf > {t/<pf} 




We conclude that 



E5 > (1 



>(1- 
= (1- 



t) 3 l-C 1/p ■ (2C((f)/t) p ) 1/q 

t) 3 \l - 2C((f)/t) a (as 1/p + l/q=l and p/q = r)/2) 
e/5) 3 (l — e/5) (substituting t and the bound for 0) 



> l-e 



as promised. 



□ 



3. The Upper Edge 



In this section we establish the following estimate for the expected largest eigenvalue, 
analogous to Theorem 11.51 for the smallest one. 
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Theorem 3.1 (Largest eigenvalue). Consider independent isotropic random vectors Xi val- 
ued in M n . Assume that Xi satisfy ( ISR|) for some C, 77 > 0. Then, for e G (0, 1) and for 

TV ^ C U pp ev c ^ m n 



the maximum eigenvalue of the sample covariance matrix S # = j? Y2i=i XiXf satisfies 

(3.1) EA max (E Ar ) < 1 + e. 

Here C upper : = 512(16C) 1+2 /"(6 + 6/r/) 1+4 /". 

We shall control the largest eigenvalue of a symmetric matrix A using the (upper) Stieltjes 
transform 

n 

fn A (u) = tr(wJ - Ay 1 = (u - X^A))' 1 , ueR. 

»=i 

Similarly to our argument for the lower edge, for a sensitivity value > we define the 
upper soft spectral edge u$(A) to be the largest u for which 

m A (u) = (f). 

Since m^(u) decreases from oo to as u increases from the upper spectral edge A max (A) to 
oo, the value u^(A) is defined uniquely, and 

u<j>{A) > A max (A). 

For — )• oo we have u^(A) — » A max (v4), but as before we shall work with small sensitivity 
values (f> G (0, 1). Our goal is to show that u^{A) increases by about 1 on average with every 
rank-one update. 

Theorem 3.2 (Random Upper Shift). Suppose X is an isotropic random vector satisfying 
the strong regularity assumption (SR) for some C,rj > 0. Assume e G (0, 1) and 

(3.2) < a^e 1 ^ 

where c j^j = 256(8C) 1+2 /''(6 + 6/77) 1+4 /^. Then for every symmetric matrix A one has 



(3.3) E u^(A + XX T ) < U(f ,(A) + 1 + e. 

Iterating Theorem 13.21 yields a proof of Theorem 13. 11 

Proof of Theorem \3.1\ The argument is similar to the proof of Theorem 11.51 given in Sec- 
tion El We set <p — 4>( £ ) — q3^2| £ 1+2 / ,? . Then we start with A = where u^(Aq) — n/<f> and 
inductively apply Theorem 13.21 for Ak = A^-i + X^X^ to obtain 



N 

N ) " N ' ' ' ~ " ' " ' (j)N' 

For N > n/e<f), the bound becomes 1 + 2e. Substituting the value of (j> and replacing e by 
e/2 gives the promised result. □ 

The above proof works for e, <f)(e) < 1 and thus for N = Q(n), but it may be extended to 
smaller N as follows. 
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Proof of Corollary \1. 4 In the proof of Theorem 13.11 we have shown that for every e G (0, 1) 
and every positive integer N, we have 

n 

E :—E A max (S iV ) < 1 + e + — — 

0(e) JV 

where <j)(e) = ffi£g] e 1+2 /^ '. Optimizing in e, we apply this estimate with £ = (n/N) 2+2 /*> when 
n < N and with e — 1/2 when n > N to obtain: 

B <5 + ^^ 1 + 22+2/ "^) lf "^ 

Combining these, for every n and N we conclude that 
as required. 

A similar bound for E A m i n (£jv) is immediate from Theorem 11.51 (see the remark after its 
proof). □ 

The rest of this section is devoted to proving Theorem 13.21 Given a matrix A, a real 
number u > A max (^4), and a vector x G R n , we say that A > is a feasible upper shift if 

(3.4) A + xx T -< (u + A)/ and m A+xx r(u + A) <TriA(u). 

The definition of the soft spectral edge u = u^(A) along with monotonicity of the Stieltjes 
transform implies that 

(3.5) u</)(A + xx T ) < u^A) + A 

for every feasible upper shift A. So will be done if we can produce a feasible shift A such 
that E A < 1 + e where the expectation is over random X. 

As in our argument for the lower edge, we begin by reducing the feasibility for a shift 5 
to an inequality involving two quadratic forms. 

Lemma 3.3 (Feasible Upper Shift). Consider numbers u G R, A > 0, a matrix A -< ul and 
a vector x. Then a sufficient condition for A > to be a feasible upper shift is 

(3.6) ^" + tr f A \ 2 * + x T {u + A - A)-'x =: Q 2 (A, x) + Q 1 (A, x) < 1. 
mA{u) — mA\u + A) 

Proof. Note that A -< ul -< (u + A)/ so that all quadratic forms are positive, and assume 
1 / since otherwise the claim is trivial. As in the proof of Lemma \2 . 2 [ we use the Sherman- 
Morisson formula to write: 

friA+xx T ( u + A) = tr(w + A — A — xx T )~ l 

x T {u + A - A)~ 2 x 



m A {u + A) + 



1 - x T (u + A - A)-^ 

x T {u + A - Ay 2 x 



m A {u) - (mA(u) - m A (u + A)) + 



1 - x T (u + A - A)- 1 x' 
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Rearranging reveals that Tn^ +xx T(u + A) < m^(!i) exactly when (13. 6 j) holds. 
To establish the second condition 

(3.7) xx T -< it + A - A, 
we recall that 

i? -< 5 ^ 5- 1/2 i?5- 1/2 -< I 

for all positive matrices R, S (this can be seen, for instance, using the Courant-Fischer 
theorem). Applying this fact to (13.71) . we see that it suffices to have 

(u + A- A)- l / 2 xx T (u + A - A)' 1 ' 2 -< I 

or equivalently 

x T {u + A - A)~ l x < 1 

which follows from (13.61) and Q 2 (A, x) > 0. □ 

We will reason about the two quantities Q\ and Q 2 separately, producing two separate 
shifts Ai and A 2 for them and eventually combining these into a single A := Ai V A 2 as 
required by Lemma [3.31 

For some fixed parameter r 6 (0, 1), let us define Ai = Ai(A, x, u) and A 2 = A 2 (v4, 2, u) 
to be the smallest non-negative numbers such which satisfy 

(3.8) Qi(Ai, x) < r, Q 2 (A 2 , x) < 1 - r. 

For m = u^(yi) and for a random vector 2 = X, Lemmas 13.41 and 13.61 will allow us to control 
the expected value of each of these shifts: 

(3.9) EAi< £ /2, EA 2 <l + e/2, 

whenever the sensitivity parameter <ft = 0(r, e) is sufficiently small. From this we will obtain 
Theorem 13.21 quickly as follows. 

Proof of Theorem \3.2[ Let u${A) = u, so the condition A -< u I of Lemma l3.3l holds. Consider 
the shifts Ai = Ai(A,X,u) and A 2 = A 2 (y4,X, u) defined above. By (13. 8p . we have 

Q 1 (A l ,X) + Q 2 (A 2 ,X) < 1. 

Moreover, a quick inspection of the quadratic forms in Lemma [3T31 shows that Qi(A, X) and 
Q 2 (A,X) are decreasing in A, hence 

Q 1 {A 1 V A 2 , X) + Q 2 (A 1 V A 2 , X) < 1. 

Then Lemma [3.31 guarantees that Ai V A 2 is a feasible upper shift, which implies by ( 13.51 ) 
that 

u^A + XX T ) < u^A) + Ai V A 2 . 

Furthermore, (I3.9P yields a bound on the expected shift 

E Ai V A 2 < E Ai + E A 2 < 1 + e, 

which gives the conclusion (I3.3P of Theorem 13.21 

It remains to note that Lemmas 13.41 and 13.61 only guarantee that the bounds (13. 9p when 
the sensitivity (f> is sufficiently small, namely <fi < <pi(r, e/2) A(f) 2 (r,e/2). With r = e/16, we 
can simplify this inequality into the assumption of Theorem 13.21 □ 

The rest of this section is devoted to controlling the shifts Ai and A 2 . 
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Remark. It is easy to check that the proofs of Lemmas 13.41 and 13.61 which follow, and conse- 
quently Theorem 13.21 only require 

(3.10) EXiXf^cI, 



in 



for some constant c = c(e). Thus if we desire a bound of A max (-^ J2i=i XiXj") < 1 + s 
Theorem \'S.1\ then EX^Xf = I can be replaced by the weaker condition f)3.10p . 

3.1. Control of Ai. 

Lemma 3.4. Consider numbers u G WL, <fi > and a matrix A -< ul satisfying m,A{u) < 4>- 
Let X be a random vector satisfying (I SRI) for some C, r\ > 0, and let e,r G (0,1). If the 
sensitivity satisfies 

T i+i/r) £ i/n 

<P < 0i(r,e) := ^i + i/^ 4 + 4/ v y+3/T,> 
then the shift Ai = Ai(A,X,u) satisfies 

EAi <e. 

Proof. Let (if>i)i< n and (Aj)j< n denote the eigenvectors and eigenvalues of A, and let & = 
(X,ipi) 2 . We know that rn^(w) = X]r=i( M — ^») 1 — 0> anc ^ * s ^ ne smanes t non-negative 
number satisfying 



E 



< T. 



1 u - Xi + Ai 

4 = 1 

Rescaling everything by and setting /ij := <ft(u — Aj) so that 

71 j n . 

g^g^Aj- 1 ' 
the problem becomes equivalent to bounding the least /x := <f>Ai for which 



- 1 

E — 



r 

< -. 



Applying the following somewhat more general probabilistic lemma to (£i)i<„, we conclude 
that 

1 1 C(4 + 4/r/) 3+ "(40) 1+r ' 
EAi < -Eu < - — — — 

(j) <f> T 1+n 

whenever 

Substituting (f> = <pi(r, e) gives the promised bound. □ 
Lemma 3.5. Suppose {£i}i<n are positive random variables with E£j = 1 and: 

(3.11) P {E&^*}^^ provided t > C\S\ = C^E&. 
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for all subsets S C [n] and some constants C, rj > 0. Consider positive numbers fa such that 

n 

i=l & 

Let fi be the minimal positive number such that 

n 

^ fa +/! 

for some K > AC . Then 

C{A + 4/fy) 3 +^ 

Proof. For simplicity of calculations, assume for the moment that the values of all fa are 
dyadic, i.e. 

^G{2°,2\2 2 ,...}. 

For each dyadic number k, let 

I k ■= {i : Ui = k}, n k := \I k \. 

By assumptions, we have 

^E> E £1= E 

1=1 fc dyadic iS/fc fc dyadic 

and /i is the smallest positive number such that 

(312) ^E?,^P^ 

i=l k dyadic ^Gifc 

We estimate // by replacing it with a bigger but easier quantity //. Define // to be the 
smallest positive number such that, for every dyadic k, one has 

1 v-^ >. , K n k K , 2_ 

> tj < e k where e fc := — V — k 2 + 2 " 



k+ u' 2 k 2a 



where 



(3.13) a:= k~^< 2 -^- T<4 + 4/r7. 



77 ^— ' A; 

dyadic A; dyadic fc 



Since 



E r^E^ E e * - t E ? + £ E 



fc + w'^ 2 ^ fc 2ct 

iS/j. fc dyadic fed — 

the definition of \i given in (13.121) yields 

H < fa 1 . 

It remains to bound E //. 



By definition, 



// = max 

A; dyadic \Efc 

iei k 
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(^Ee.-* 



Let Ok = — J2i£i k 6 — h. For every t > 0, one has 

P{0 fe >t} = P{^& > (fc + t)e fc }. 



mr riDTiTiifinn ixro no^ro 

2fc 



Since Ef. > -£r by definition, we have 



(k + t)e k >ke k >^ = ^ E > C E ( J^e< 

So by the regularity assumption ( 13. lip 



P{0* > t} < ^ 



(A; + 1+ "4 +r? ' 
A union bound then gives 

P{u' >t\ < V TX - 

< jTT-rz. \ i > 7i m — (by definition of Ek) 

dyetdic 

< c y L_ 



This implies that 

Eu'= / P{// >t)dt< 7 ; — — > / 7- 

~ (K/2a) l+r i 2s 77/2 

v ' 7 fc dyadic '' 

C 2 4 

< - — ; — — — — • - (by a calculation similar to (13. 131) ) 

(K/2cr) L+T i T) 17 

The promised bound for general (non-dyadic) /ij follows by rounding each down to the 
nearest power of 2 and replacing K by K/2. □ 

Remark (Necessity of the strong regularity assumption (ISR|) ). The preceding lemma is the 
only place in the proof where the full power of (I SRI) is used. To see that it is necessary, 
consider the following situation. Fix any S C [n] and let — = l{j e 5}|5| so that £\ — = 1. 

Then the smallest a > for which V ■ — i— < K is just 



"=(^E«.-i s i 



18 



We now lowerbound the tail probability 

P{/U > t} = P{J^6 > K(\S\ + t)} >P<[^&>2fft} fort> \S\. 

ies ies 

In order to have E/x = 0(1), this probability must be 0(l/t) by Markov's inequality, which is 
essentially assumption (13.111) of the lemma. In the proof of Theorems 11.11 and 13 . 2 [ the sums of 
random variables £j arise from projections of the random vector X onto varying eigenspaces 
of A; the only succinct way to guarantee (13.111) for all such projections is essentially (I SRI) . 

3.2. Control of A 2 . 

Lemma 3.6. Consider numbers u G R, <p > an d a matrix A -< ul satisfying rf%A{u) < 0. 
Lei X be a random vector satisfying (ISR|) /or some C, n > 0, and Ze£ e G (0, 1), < r < e/2 
6e parameters. If the sensitivity satisfies 

<t> < <h\J,e) ■- 128 . (2C) 2 /"(4 + 6/n) 4 /^ 
iTien i7ie s/ii/t A 2 = A 2 (A, X,u) satisfies 

EA 2 < 1 + e. 

It will more convenient to work with the quadratic form 

a (a X ) - xT ( u + A ~ A y 2x 

Q 2 {/X,x).- tr(M + A _ A) _ 2 , 

for which we have 

(3.14) ^<5' 2 (A, x) > Q 2 (A, x) for A > 0, 

since the denominators satisfy: 

m A (u) - m A (u + A) = tr[(ul - A) -1 - (u + A - A) -1 ] > A tr(u + A - Ay 2 . 

Remark. The reason for working with Q 2 rather than directly with Q 2 in Lemma 13.31 is that 
Q 2 (A, x) is decreasing in A; this monotonicity is required when arguing that the maximum 
of the two shifts A = Aj V A 2 is feasible in the proof of Theorem 13 .21 

We begin by recording some regularity properties of Q' 2 (A,X). 

Lemma 3.7 (Regularity and Moments of of Q' 2 (A, X)). Consider numbers u G R, > 
and a matrix A -< ul satisfying m A (u) < <p. Let X be a random vector satisfying (ISR|) for 
some C, 7] > 0. Then for every A > one has: 

^Q 2 (A,X)<(1 + 0A) 2 Q' 2 (O,X); 

(ii)EQ' 2 (A,X) = l; 

(m) EQ' 2 (A, Xf < (7(3 + 3/n) for p = 1 + 2r//3. 

Proof, (i) is analogous to Lemma \2. 41 In a similar way, we show that all eigenvalues A» of A 
satisfy u — \ > 1/0, which implies the comparison inequality 

n-Ai<n + A-Ai<(l + <j>A){u - A<). 
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Denoting (ipi)i< n the eigenvectors of A, we express 

(3,5) ^ ^^-^^-W^. 

The comparison inequality yields (i). 

(ii) We note that (13. 15ft can be rearranged as a convex combination of (X, ipi) 2 : 

Q' 2 (A, X) = an(X, tpi) 2 where > 0, aj — 1. 

Then (ii) follows since E(X, ?/>j) 2 = 1 by isotropy. 
(hi) We apply Minkowski's inequality to obtain 

n 

(EQ' 2 (A,xr)^ <J2*i(HX^i) 2p ) 1/p - 
1=1 

Now a simple integration of tails implies that each 

E(X, fa)* = E(X,^) 2+4 "/ 3 < C(3 + 3/77), 

which concludes the proof. □ 

Next, we see how the regularity properties of Q' 2 (A,X) translate into the corresponding 
properties of A 2 : 

Lemma 3.8 (Regularity of A2). Consider numbers u G R, > and a matrix A -< ul 
satisfying fn^u) < <f). Let X be a random vector satisfying (ISR|) for some C, rj > 0, and let 
< t < 1/2. Then the shift A 2 = A 2 (A, X,u) satisfies: 

(1) EA^ /2 < 2 1+ "C(4 + 6/r/) 2 / 

(ii) E A 2 l{Q' 2 (o,x)<(t-2T)/84>} <l + t for every t E [0, 1]. 

Proof, (i) By definition of A2 and using ( 13.141) . we have for all t > 0: 

P{A 2 >t}< ¥{Q 2 (t, X) > 1 - r} < F{Q' 2 {t, X) > t{l - r)}. 

This probability can be controlled using Lemma 13.71 (iii) and Markov's inequality, so we 
obtain 

PiA >t\< °( 3 + 3 ^ < g ( 3 + 3 ^) 

1 2 1 ~ £1+277/3(1 -T) 1+2r ?/ 3 ~~ (l/2) 1+2r '/ 3 t 1+2r '/ 3 

as t < 1/2. Integration of tails yields 

E A\ +v/2 < 2 1+2 ^ 3 ■ C(3 + 3/r]) (4 + 6/77), 

which implies the claim. 

(ii) Let So denote the smaller solution of the quadratic equation 

(l + s0) 2 Q' 2 (O,X) = s(l-r), 

whenever a solution exists. In this case Sq > and Lemma 13.71 (i) yields that 

Q' 2 (s ,X)<s (l-r). 

By (13.141) . this yields Q 2 (sq,X) < sq(1 — r). By definition of A 2 , this in turn implies that 

A 2 < s . 
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An elementary calculation shows that if Q' 2 (0,X) < (t — 2r)/80 then the solution so exists 
and satisfies 

s < (1 + t)Q' 2 (0, X). 

It follows that 

Esol { Q^(O,X)<(t-2r)/80} < (1 + t)EQ 2 (0,X) = 1 + t, 

where we used Lemma [3.71 (i) in the last step. □ 

We can now complete the proof of Lemma 13.61 
Proof of Lemma \3.(A We decompose 

EA 2 = E A 2 l{Q^(o,x)<(t-2T)/8</>} + E A 2 l{Q^(o,x)>(t-2T)/8<£} = : Ei + E 2 . 
By Lemma [3.81 (ii), we have E\ < 1 + t. Next, we estimate E 2 using Holder's inequality: 

E 2 < (EA2 +r?/2 ) TT ^(P{Q / 2 (0,X) > (t- 2r)/80}) T ^75. 

The two terms here can be estimated using Lemma 13.81 (i) and Lemma 13.71 along with 
Markov's inequality: 

C(3 + 3 //) • — 



E 2 < (2 1+71 C(A + Q/r]) 2 ) 1+ ^ 2 



<2 1 ^(4 + 6/r ? ) 2 .(^-) 



((t-2r)/80) 1+ "/ 2 
y//2 



1+V/2 



Finally, we set t = e/2 and use the assumptions (ft < <p 2 (r,e) and r < e/2 to conclude that 
E 2 <e/2. Together with Ei<l + t = l + e/2 this implies 

EA 2 < 1 + e 

as claimed. □ 

Remark. Although for convenience of application Lemma 13.61 is stated under the strong 
regularity assumption (ISRj) . the latter is not used in the proof. The argument above uses 
only the weak regularity assumption (|WR|) . 



4. The Spectral Norm 

In this section we prove Theorem II. II by showing that whenever Xi, . . . , X^ are indepen- 
dent and satisfy (I SRI) , the spectral norm estimate 

(4.1) E||Ejv-/||<£ 
follows from the spectral edge estimates 

(4.2) E \ niin (Z N ) > 1 - e/3; E \ m3X {Z N ) <l + e/3 

obtained in Theorems 11.51 and 13.11 The basic idea is to show using independence that 

Aaverage(SAr) = — tr(S^v) 

is concentrated near its expectation of 1. Combining this with 

E (A max (Sjv) (Ejv)) < 2e/3, 

which follows immediately from (14. 2p . yields (14. ip . 
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We rely on the following elementary proposition regarding sums of independent random 
variables. 

Proposition 4.1. Let Zj be independent random variables with EZj = 1 and satisfying the 
following tail bounds for some C,i] > 0: 

H\ z i\ > t} < cr 1 -* 1 , t>0. 

If e e (0, 1) and 

(2C)^(1 + l/ v )*'<> 
(£/2) 2 + 2 A? 

then 

< :. 

i=i 

Postponing the proof of Proposition 14. 1[ we use this fact to control 



i 1 N 



N "y||2 



n n z — ' N 

i=l 

and prove the main theorem as follows. 

Proof of Theorem Assume the random vectors Xi are isotropic and satisfy (ISRI) with 
parameters C, rj. This implies that the random variables 

Zi= \\m 

n 

satisfy the requirements of Proposition 14.11 with parameters C 1+V ,rj. It follows that 

N 

' < £ 



(4.3) E -tr{E N -I) -E|i-V^-1 

i=l 

whenever 

(4 . 4) ^(4Cr"(l + l/,)'" .C„ 



g-2+2/r? ' e 2+2/r t ' 

Now consider the random variables 

L = A min (S Ar - -0, U = A max (SAT - /), M = — tr(£jv - I). 

n 

We have 

L < M < U, 

and we are interested in 

(4.5) \\E N -I\\ = UV-L<U-L+\M\. 

When N > C uwer n/e 2+2/ ' n , Theorem O gives EC/ < e. To show that EL > e, we 
recall that (ISRI) with parameters C, r/ implies (IWRj) with parameters C(2 + 2/r)),r) and invoke 
Theorem ll.5[ noting that its requirement (11.81) is satisfied as 

Copper = 512(16C) 1+2/,? (6 + Q/ V ) 1+ ^ > 40(10C(2 + 2/r/)) 2 /" = C lower . 
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Now that we have both bounds E U < e and E L > e, we can combine them with (14. 3 j) and 
(14. 5p . which yields 

E||Ejv-/|| < + 

whenever 

(4.6) AT > C u pper — V C trace 



upper £ 2+2/ V v ^ trace ' 

Replacing e by e/3 and taking 



N>C m " 



where 



C ma in := 512 • 3 2+2/r? • (16C) 2+2/ "(6 + 6/r]) 1+ ^ 
always satisfies (14. 6p . This completes the proof of the theorem. □ 



Proof of Proposition J^.l. Fix a parameter K > 0, and decompose 

Zi = Zi l { \ Zt \<K} + Zi l{\ Zi \>K} ='■ Z\ + Z". 
Using E Z[ + E Z" = E Zi = 1 and by triangle inequality we obtain 

N N N N N 

eUJ>-i <t\-^zi-E-^zi\ + t\--£z!-t--£z!\=:e + Er. 



i=l i=l i=l i=l i=l 

By Jensen's inequality, independence and the bound on Z[, we have 

N , N 



(£T<v ar (l£z<) = ^£va r( z;)<^. 



i=l i=l 

Moreover, by triangle and Jensen's inequalities 

N „ N 



<2E|-Vz;i < - Veiz"!. 

- | TV" * | - TV" 1 1 1 

i=l i=l 

The assumption on the tails of Z { implies that ¥{\Z'{\ > t} < C/{t V K) 1+r > for t > 0, thus 



Hence 



poo n n i 

141 y u 1 1 J 1^ 7?^ v t]J 

E" < 2C(l + 



' + £" < + 2C( 1 + -)K- r >. 

V 77/ 



and 

Choosing K = (e/2)\/ r N and using the assumption on N, one easily checks that 

E> + E" < - + - < e 

as desired. □ 
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Appendix. Proof of Proposition 11.31 

In this section we prove Proposition 11.31 which states that product distributions satisfy 
the regularity assumption in Theorem ll.il Note that this result and its proof are not needed 
in the proof of Theorem 11.11 

Consider a random vector X and an orthogonal projection P in IR n as in Proposition 11.31 
Denoting by (Py) the n x n matrix of the operator P, we express 

n 

\\PX\\l=(X,PX)=Y / te j P ij . 
The contribution of the diagonal of P to this sum is 

n 
i=l 

Denote by Pq the matrix P with diagonal removed; then 

(4.7) \\PX\\ 2 2 -D = (X,P X). 

We can estimate (X, PqX) using a standard decoupling argument. Let X' denote an 
independent copy of X, and let Ex, E x > denote the expectations with respect to X and X' 
respectively. Since the matrix P has zero diagonal, we have^ 

(4.8) E | (X, P X) \ p < E x , Ex | (X, P X') |* . 

This inequality can be obtained from general decoupling results, see [7J Theorem 3.1.1]; a 
simple and well known proof of (14. 8p is given in [TTj . 

Next, an application of a standard symmetrization argument and Khintchine inequality 
(or a direct application of Rosenthal's inequality [32], see [8]) yields for every a6l" that 



E\(X, a)\ p = E\j2^i 



< Hall? 



i=l 

Therefore, by conditioning on X' we obtain from ( 14 ,8p that 
(4.9) E\(X, P X)\ P < E x > \\PoX'f 2 = E \\P Xf 2 . 

Since P equals P without the diagonal, the triangle inequality yields 

1/2 



PnX o < PX 



8=1 



Since < Pa < \\P\\ < 1, we can replace P? by Pa, so 

||P X|| 2 < ||PX|| 2 + D 1/2 < (\\PX\\l + D) 1 ' 2 . 
Holder's inequality then implies that 

(4.10) E\\P Q X\\l < (E|||PX||2 + J D| P ) 1/2 . 

Putting ( 14. 7p . (14. 9 p and (I4.10p together, we arrive at the inequality 

E\\\PX\\l- D\ p < (E\\\PX\\l + D\ p ) 1/2 . 



2 Throughout this proof, we write a < b if a < Cb for some constant C which is independent of 1 
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Put in different words, the random variable Z := ||PX||| — D satisfies the inequality 

\\Z\\ 2 r <\\Z + 2D\\ L <\\Z\\ L + 2\\D\\ L . 

II II J_jp r^j II 1 \\ J-^p — II ll-t-'p 1 II 11-tJp 

Solving this quadratic inequality we obtain that 
(4.H) \\Z\\ Lp < 1 + \\D\\%. 

In order to bound ||-D|| ip we consider 

n n 

\\ D - k \\l = E | -k\ p = E | - m, 

8=1 i=l 

where we used that J2i=i Pa = tr(P) = k. Recall that by the assumptions we have E(£ 2 — 1) = 
and ||£ 2 — l||t p < ||£ 2 ||l p + 1 = ||Ci|li 2 + 1 ~ 1- An application of Khintchine's inequality 
or Rosenthal's inequality (as before) and the bound < Pa yield that 

(4.12) \\D-k\\l p < (E^) P/2 < (f:^) P/2 = (tr(P)) p/2 = F /2 . 

8=1 i=l 

It follows that 

\\D\\ Lp < \\D-k\\ Lp + k<k^ 2 + k<k. 
Putting this into (14. Ill) , we see that 

(4-13) \\Z\\ Lp <k^ 2 . 

Finally, by definition of Z and using the triangle inequality and bounds (I4.13p . (14.121) . we 
conclude that 

|| ||PX|| 2 - k\\ Lp < \\Z\\ Lp + \\D - k\\ Lp < k 1/2 + k 1/2 < k 1/2 . 
Proposition 11.31 is proved. □ 
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